Skip to content

Add deterministic code and snippet memory identity#181

Open
hunterbastian wants to merge 3 commits into
XortexAI:mainfrom
hunterbastian:codex-code-snippet-schema
Open

Add deterministic code and snippet memory identity#181
hunterbastian wants to merge 3 commits into
XortexAI:mainfrom
hunterbastian:codex-code-snippet-schema

Conversation

@hunterbastian
Copy link
Copy Markdown
Contributor

Summary

Implements deterministic identity metadata for code annotations and personal snippets so XMem can avoid re-judging exact code/snippet memories with an LLM.

Changes:

  • add stable Pinecone metadata helpers for snippet identity, snippet search text, code annotation identity keys, and code annotation content hashes
  • route code and snippet memory through deterministic judge paths using metadata lookups
  • store snippet_hash, annotation_key, and annotation_hash in Pinecone metadata
  • keep snippet code exact in metadata while embedding only the searchable description/language/tags text
  • add regression coverage for repeated snippets across sessions and same-target code annotation updates

This addresses the edge case discussed in #141 where a user sends a snippet, then asks for the same snippet in another session. The normalized snippet_hash lets the judge no-op the duplicate without another model call.

Verification

  • python3 -m compileall src/schemas/code.py src/agents/judge.py src/pipelines/ingest.py src/pipelines/weaver.py tests/unit/test_schemas.py tests/test_deterministic_memory_layer.py
  • uv run --extra dev pytest tests/unit/test_schemas.py tests/test_deterministic_memory_layer.py -> 12 passed
  • uv run --extra dev pytest -> 44 passed
  • uv run ruff check --select F401 src/schemas/code.py src/agents/judge.py src/pipelines/ingest.py src/pipelines/weaver.py tests/unit/test_schemas.py tests/test_deterministic_memory_layer.py
  • git diff --check

/claim #141

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements deterministic judging for code annotations and snippets, centralizing parsing and metadata generation logic within src/schemas/code.py. The JudgeAgent is updated to handle these new domains by performing metadata lookups to decide between adding, updating, or skipping items. Feedback suggests optimizing the _deterministic_code and _deterministic_snippet methods by deduplicating incoming items and using asyncio.gather for parallel metadata lookups to improve performance and prevent redundant operations.

Comment thread src/agents/judge.py
Comment on lines +496 to +534
async def _deterministic_code(
self, new_items: list, user_id: str,
) -> JudgeResult:
operations: list[Operation] = []
for item in new_items:
content = str(item)
fields = code_annotation_fields_from_storage_content(content)
match = await self._lookup_metadata_match({
"user_id": user_id,
"domain": JudgeDomain.CODE.value,
"annotation_key": code_annotation_identity_key(fields),
})

if match is None:
operations.append(Operation(
type=OperationType.ADD,
content=content,
reason="No code annotation with the same repo/target/type key.",
))
continue

incoming_hash = code_annotation_content_hash(fields)
existing_hash = str((match.metadata or {}).get("annotation_hash", ""))
if incoming_hash == existing_hash:
operations.append(Operation(
type=OperationType.NOOP,
content=content,
embedding_id=match.id,
reason="Existing code annotation is unchanged.",
))
else:
operations.append(Operation(
type=OperationType.UPDATE,
content=content,
embedding_id=match.id,
reason="Existing code annotation target has updated content.",
))

return JudgeResult(operations=operations, confidence=1.0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _deterministic_code method processes items sequentially and lacks deduplication of the incoming new_items. If multiple identical annotations are extracted in a single turn, this will result in redundant operations and potential duplicate records in the vector store. It is recommended to deduplicate items by their identity key and use asyncio.gather to perform metadata lookups in parallel, maintaining consistency with the profile and temporal domains.

    async def _deterministic_code(
        self, new_items: list, user_id: str,
    ) -> JudgeResult:
        # Deduplicate items by identity key to prevent redundant operations
        unique_items: dict[str, tuple[str, dict]] = {}
        for item in new_items:
            content = str(item)
            fields = code_annotation_fields_from_storage_content(content)
            key = code_annotation_identity_key(fields)
            unique_items[key] = (content, fields)

        async def _process_one(content: str, fields: dict) -> Operation:
            match = await self._lookup_metadata_match({
                "user_id": user_id,
                "domain": JudgeDomain.CODE.value,
                "annotation_key": code_annotation_identity_key(fields),
            })

            if match is None:
                return Operation(
                    type=OperationType.ADD,
                    content=content,
                    reason="No code annotation with the same repo/target/type key.",
                )

            incoming_hash = code_annotation_content_hash(fields)
            existing_hash = str((match.metadata or {}).get("annotation_hash", ""))
            if incoming_hash == existing_hash:
                return Operation(
                    type=OperationType.NOOP,
                    content=content,
                    embedding_id=match.id,
                    reason="Existing code annotation is unchanged.",
                )
            else:
                return Operation(
                    type=OperationType.UPDATE,
                    content=content,
                    embedding_id=match.id,
                    reason="Existing code annotation target has updated content.",
                )

        tasks = [_process_one(c, f) for c, f in unique_items.values()]
        operations = await asyncio.gather(*tasks)
        return JudgeResult(operations=list(operations), confidence=1.0)

Comment thread src/agents/judge.py
Comment on lines +536 to +563
async def _deterministic_snippet(
self, new_items: list, user_id: str,
) -> JudgeResult:
operations: list[Operation] = []
for item in new_items:
content = str(item)
fields = snippet_fields_from_storage_content(content)
match = await self._lookup_metadata_match({
"user_id": user_id,
"domain": JudgeDomain.SNIPPET.value,
"snippet_hash": snippet_identity_hash(fields),
})

if match is None:
operations.append(Operation(
type=OperationType.ADD,
content=content,
reason="No snippet with the same normalized code/content identity.",
))
else:
operations.append(Operation(
type=OperationType.NOOP,
content=content,
embedding_id=match.id,
reason="Same snippet was already stored for this user.",
))

return JudgeResult(operations=operations, confidence=1.0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to _deterministic_code, the _deterministic_snippet method should deduplicate new_items by their identity hash and parallelize the metadata lookups using asyncio.gather to improve performance and prevent duplicate operations.

    async def _deterministic_snippet(
        self, new_items: list, user_id: str,
    ) -> JudgeResult:
        # Deduplicate items by snippet hash to prevent redundant operations
        unique_items: dict[str, tuple[str, dict]] = {}
        for item in new_items:
            content = str(item)
            fields = snippet_fields_from_storage_content(content)
            h = snippet_identity_hash(fields)
            unique_items[h] = (content, fields)

        async def _process_one(content: str, fields: dict) -> Operation:
            match = await self._lookup_metadata_match({
                "user_id": user_id,
                "domain": JudgeDomain.SNIPPET.value,
                "snippet_hash": snippet_identity_hash(fields),
            })

            if match is None:
                return Operation(
                    type=OperationType.ADD,
                    content=content,
                    reason="No snippet with the same normalized code/content identity.",
                )
            else:
                return Operation(
                    type=OperationType.NOOP,
                    content=content,
                    embedding_id=match.id,
                    reason="Same snippet was already stored for this user.",
                )

        tasks = [_process_one(c, f) for c, f in unique_items.values()]
        operations = await asyncio.gather(*tasks)
        return JudgeResult(operations=list(operations), confidence=1.0)

@hunterbastian
Copy link
Copy Markdown
Contributor Author

Pushed a follow-up in 0345c53 addressing the Gemini review notes.\n\nWhat changed:\n- deduplicates incoming code annotations by deterministic annotation key before lookup;\n- deduplicates incoming snippets by normalized snippet hash before lookup;\n- runs the resulting metadata lookups concurrently with asyncio.gather;\n- added regression tests for duplicate code/snippet extractions in one deterministic judge batch.\n\nVerification rerun locally:\n- python3 -m compileall src/agents/judge.py tests/test_deterministic_memory_layer.py\n- git diff --check\n- uv run --extra dev pytest tests/test_deterministic_memory_layer.py tests/unit/test_schemas.py -> 14 passed\n- uv run --extra dev pytest -> 46 passed\n- uv run ruff check --select F401 src/agents/judge.py tests/test_deterministic_memory_layer.py

Copy link
Copy Markdown
Collaborator

@Ankit-Kotnala Ankit-Kotnala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hunterbastian thanks for the update. The overall direction looks good, especially moving code/snippet judging to deterministic metadata lookups and addressing the dedupe/parallel lookup feedback.

I’d like to hold off on merging until two identity issues are fixed:

  1. code_annotation_identity_key() should include both target_file and target_symbol when present. Right now it uses target_symbol over target_file, so the same symbol name in different files within a repo can collide.

  2. snippet_identity_hash() should preserve code identity more strictly. Since stable_hash() lowercases and collapses whitespace, two different code snippets can be treated as the same snippet. For code snippets, we should hash the normalized code text without lowercasing/collapsing internal whitespace.

Please add regression tests for both cases and rerun the full CI once the workflow is approved.

@hunterbastian
Copy link
Copy Markdown
Contributor Author

Pushed follow-up in f51930f for the requested identity fixes.\n\nWhat changed:\n- code_annotation_identity_key() now includes both target_file and target_symbol, so same-symbol annotations in different files no longer collide.\n- snippet_identity_hash() now hashes normalized code text without lowercasing or collapsing internal whitespace when a code snippet is present.\n- Added regression coverage for both collision cases and updated the deterministic weaver expectation to the new annotation key shape.\n\nVerification rerun locally:\n- .venv/bin/python -m pytest tests/unit/test_schemas.py -q -> 6 passed\n- .venv/bin/python -m pytest tests/test_deterministic_memory_layer.py -q -> 10 passed\n- .venv/bin/python -m pytest -q -> 48 passed\n- git diff --check -> passed\n\nThe PR Labeler check is also green on the new head.

@ishaanxgupta
Copy link
Copy Markdown
Member

@hunterbastian Please have a discussion on the issue #141 so that we can discuss the approach first and then you can implement in the PR.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 23, 2026

Greptile Summary

This PR introduces deterministic identity metadata for code annotations and personal snippets, letting XMem skip LLM judgment calls when re-encountering the same code or snippet across sessions. Parsing, hashing, and Pinecone metadata construction are centralized in src/schemas/code.py, the judge routes CODE and SNIPPET domains through new deterministic paths, and the weaver uses the shared helpers instead of inline dicts.

  • snippet_identity_hash uses the normalized code text (case-preserved, trailing-whitespace stripped) as the primary identity signal, falling back to a case-insensitive description+language hash when no code is present — correctly handling cross-session duplicates like the \ -vs-\\\ encoding difference between storage and ingestion.
  • code_annotation_content_hash identifies annotation changes by hashing identity key + severity + content; currently uses the case-insensitive stable_hash, which would suppress an UPDATE when only the annotation body casing changes.
  • ingest.py creates ephemeral JudgeAgent instances scoped to the right vector store for each domain, and correctly binds weaver.snippet_vector_store before both the judge and the weaver execute.

Confidence Score: 3/5

The new deterministic paths work correctly for the happy path, but a case-only change to a code annotation's body text would be silently swallowed as a no-op rather than written as an update.

The code_annotation_content_hash function hashes the annotation content through stable_hash, which collapses casing. Any annotation whose body changes purely in letter-case (e.g. correcting identifier casing in free-text, changing 'NULL' to 'null') will produce the identical hash before and after the change. The deterministic judge will classify the incoming item as NOOP and the stored annotation will remain stale. All other logic — the snippet hash path, the ingest wiring, the weaver metadata helpers, and the test coverage — looks correct.

src/schemas/code.py — specifically code_annotation_content_hash and its use of stable_hash for the content field.

Important Files Changed

Filename Overview
src/schemas/code.py Adds deterministic identity helpers (parse, hash, metadata builders) for code annotations and snippets; code_annotation_content_hash uses the case-insensitive stable_hash, which can silently suppress updates when only annotation body casing changes.
src/agents/judge.py Routes CODE and SNIPPET domains through new deterministic paths; _lookup_metadata_match correctly uses asyncio.to_thread matching the existing profile pattern, but has no fallback when search_by_metadata is unavailable.
src/pipelines/ingest.py Replaces LLM judge calls with ephemeral deterministic JudgeAgent instances for code and snippet domains; correctly binds snippet_vector_store before both the judge and the weaver now.
src/pipelines/weaver.py Replaces inline metadata dicts and local parser functions with schema helpers; embedding text for snippets is now richer (description + language + tags) instead of bare description.
tests/test_deterministic_memory_layer.py Adds four integration tests covering cross-session snippet dedup, batch dedup, metadata persistence, and code annotation update detection; FakeVectorStore correctly models metadata equality search.
tests/unit/test_schemas.py Adds unit tests for all new schema helpers; verifies hash stability, identity key format, and metadata field values.

Sequence Diagram

sequenceDiagram
    participant Ingest as IngestPipeline
    participant Judge as JudgeAgent (ephemeral)
    participant Schema as schemas/code.py
    participant Store as VectorStore (Pinecone)
    participant Weaver as Weaver

    Ingest->>Judge: "arun_deterministic({domain: CODE/SNIPPET, new_items, user_id})"
    Judge->>Schema: code_annotation_fields_from_storage_content(content)
    Schema-->>Judge: fields dict
    Judge->>Schema: code_annotation_identity_key(fields) / snippet_identity_hash(fields)
    Schema-->>Judge: identity key / hash
    Judge->>Store: "search_by_metadata({user_id, domain, annotation_key/snippet_hash})"
    Store-->>Judge: SearchResult or None
    alt No match found
        Judge-->>Ingest: Operation(ADD)
    else Match found, hash unchanged
        Judge-->>Ingest: Operation(NOOP)
    else Match found, hash changed (code only)
        Judge-->>Ingest: Operation(UPDATE)
    end
    Ingest->>Weaver: execute(judge_result, domain, user_id)
    Weaver->>Schema: code_annotation_pinecone_metadata / snippet_pinecone_metadata
    Schema-->>Weaver: metadata dict (with annotation_key/annotation_hash or snippet_hash)
    Weaver->>Store: add / update with enriched metadata
Loading

Fix All in Cursor Fix All in Codex Fix All in Claude Code

Reviews (1): Last reviewed commit: "Fix deterministic code identity collisio..." | Re-trigger Greptile

Comment thread src/schemas/code.py
Comment on lines +427 to +432
def code_annotation_content_hash(fields: dict[str, Any]) -> str:
return stable_hash(
code_annotation_identity_key(fields),
fields.get("severity"),
fields.get("content"),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 code_annotation_content_hash calls stable_hash, which runs every part through normalize_lookup_text (lowercase + collapse-whitespace). A case-only change to an annotation body — e.g. correcting Auth.loginauth.login in the free-text content, or capitalizing a variable name — produces the same hash as the original, so the deterministic judge returns NOOP and silently discards the update. Code identifiers in annotation text can be case-significant.

Suggested change
def code_annotation_content_hash(fields: dict[str, Any]) -> str:
return stable_hash(
code_annotation_identity_key(fields),
fields.get("severity"),
fields.get("content"),
)
def code_annotation_content_hash(fields: dict[str, Any]) -> str:
return strict_hash(
code_annotation_identity_key(fields),
fields.get("severity"),
fields.get("content"),
)

Fix in Cursor Fix in Codex Fix in Claude Code

Comment thread src/agents/judge.py
Comment on lines +576 to +585
async def _lookup_metadata_match(
self, filters: Dict[str, Any],
) -> Optional[SearchResult]:
if not self.vector_store:
return None
search_fn = getattr(self.vector_store, "search_by_metadata", None)
if search_fn is None:
return None
results = await asyncio.to_thread(search_fn, filters=filters, top_k=1)
return _first_match(results or [])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No fallback when search_by_metadata is absent

_lookup_metadata_match silently returns None when the injected vector store lacks a search_by_metadata method. Every incoming CODE or SNIPPET item then resolves to OperationType.ADD, so the same snippet or code annotation will be re-inserted on every session rather than deduped. The existing _fetch_similar_profile_metadata falls back to a semantic search in this case; applying the same fallback (or at least a warning) here would keep the two deterministic paths consistent and avoid silent duplicate growth.

Fix in Cursor Fix in Codex Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants